A short description of the post.
The Kronos incident was the challenge plot in VAST 2014. The questions were a bit different in 2014 and had a focus on describing the incident timeline. Majority of the candidates use timeline to visualize the progress of the incident. The common questions asked in 2014 and 2021 are connections between POK and GASTech.
Previous projects use tools, such as python, D3 and Visio to visualize each corpus in a grid with a function to search. Network graphs are commonly used to visualize relationship between personels in GASTech and POK. The approach are effective and will be used in the assignment too, with help of R packages.
There isn’t any high level analysis of the news sources, potentially because the questions did not focus on the sources, rather on the incident. To counter for the difference in questions this year, wordcloud plots will be used (general, comparison and commonality) to gain high level understanding of each news source.
packages = c('stringr', 'stringi','tidyr','tidyverse', 'dplyr', 'tm', 'lubridate',
'corporaexplorer', 'quanteda','quanteda.textstats',"igraph",
"visNetwork", "tidygraph", "ggraph","networkD3", "stm",
"tidytext", "widyr", "wordcloud",
"ggwordcloud", "textplot")
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Overview: In this task, all news articles will be loaded in and format in to a data frame. After cleaning, cosine similarity is computed for each pair of articles. Only the top 0.5% pairs with highest similarity will be analyzed here. The threshold cosine similarity is 0.54. It is based on assumption that pairs of article with cosine similarity above the value possess strong similarity. Within each pair, the article published earlier is considered the primary source and the one published later is considered secondary source. This value forms a new column, “role”, in the data frame. Each article under the top 0.5% pairs is labeled with the article IDs that are similar to it, both published earlier and later. New column is named as, “similar_earlier” and “similar_later”. This is to enable easier visualization in the interactive corpus explorer. A interactive directed network graph is used to visualize connections between sources. After adding in new columns, corpus data frame is visualized with corpus explorer with search function. Steps and demonstration are described below.
1. Get the list of articles in all paper sources
list_of_paper <- list.files(path = "data/News Articles", recursive = TRUE,
pattern = ".",
full.names = TRUE)
2. Load in each article in a iterative way and save them into a list of data frames
df_list <- list()
num <- 1
for (i in list_of_paper){
temp <- lapply(i, readLines)
temp <- lapply(1:length(temp),
function(j) data.frame(
news_no=str_extract(i, "(?<=\\/)\\d+"),
rawdata=temp[[j]],
stringsAsFactors = FALSE))
df_temp <- do.call(rbind, temp)
df_temp[,c("type","entry")] <-
str_trim(str_split_fixed(df_temp$rawdata,":",2))
df_temp <- df_temp[,c("news_no","type","entry")]
df_temp <- pivot_wider(df_temp, names_from = type, values_from = entry)
df_list[[num]] <- df_temp
num <- num+1
}
3. Bind all data frames in the list
df <- do.call(rbind.fill,df_list)
1. check data types
formats <- c("%d %B %Y","%d%B %Y","%B %d, %Y","%Y/%m/%d")
df <-df %>%
mutate(date = parse_date_time(PUBLISHED, formats))
df$date <- as.Date(as.POSIXct(df$date,tz="GMT"))
df$news_no <- as.integer(df$news_no)
2. change column name and adjust column order in order to load to corpus
3. This data frame will be used as the main corpus to visualize in corpus explorer. No stopword and punctuation are removed in order to not affect reading. Save the data.
write_rds(df_corpus, "data/rds/original_corpus")
news_corpus <- corpus(df_corpus, docid_field="Doc_id", text_field="Text")
clean_corpus <- dfm(news_corpus, remove=stopwords("english"), remove_punct=TRUE)
sim <- textstat_simil(clean_corpus, margin="document", method="cosine")
sim_df <- as.data.frame(sim)
sim_df$document1 <- as.integer(as.character(sim_df$document1))
sim_df$document2 <- as.integer(as.character(sim_df$document2))
2. create a table of source for merging later
df_source <- df_corpus %>%
select(Doc_id,Source,date)
sim_tmp <- sim_df %>%
rowid_to_column(var="pair_id") %>%
mutate(rank=ntile(cosine,200)) %>%
filter(rank==200) %>%
left_join(df_source, by=c("document1"="Doc_id"), suffix=c("1","2")) %>%
left_join(df_source, by=c("document2"="Doc_id"), suffix=c("1","2")) %>%
filter(Source1 != Source2) %>%
filter(date1 != date2) %>%
transform(document1=ifelse(date1<=date2, document1, document2),
document2=ifelse(date1<=date2, document2, document1)) %>%
transform(Source1=ifelse(date1<=date2, Source1, Source2),
Source2=ifelse(date1<=date2, Source2, Source1))
df_primary <- sim_tmp %>%
group_by(document1) %>%
dplyr::summarise(n = n()) %>%
mutate(role="primary")
df_secondary <- sim_tmp %>%
group_by(document2) %>%
dplyr::summarise(n = n()) %>%
mutate(role="secondary")
#join the new column to df_corpus
df_join <- df_corpus %>%
left_join(df_primary, by=c("Doc_id"="document1"), suffix=c("_pri","_sec")) %>%
left_join(df_secondary, by=c("Doc_id"="document2"), suffix=c("_pri","_sec")) %>%
unite("role", n_pri, role_pri, n_sec, role_sec, sep="_",na.rm=TRUE)
glimpse(df_join)
Rows: 845
Columns: 9
$ Doc_id <int> 121, 135, 152, 154, 237, 251, 341, 391, 420, 554, …
$ Text <chr> "Fifteen members of the Protectors of Kronos (POK)…
$ Source <chr> "All News Today", "All News Today", "All News Toda…
$ Title <chr> "POK PROTESTS END IN ARRESTS", "RALLY SCHEDULED IN…
$ Author <chr> NA, NA, NA, "Petrus Gerhard", NA, NA, NA, NA, NA, …
$ date <date> 2005-04-06, 2012-04-09, 1993-02-02, 1998-03-20, 1…
$ Location <chr> "ELODIS, Kronos", "ABILA, Kronos", "ABILA, Kronos"…
$ Note <chr> NA, NA, NA, "This article is the first in a series…
$ role <chr> "1_primary_1_secondary", "4_primary", "3_primary",…
pair_pri <- sim_tmp %>%
select(cosine,pair_id,document1,document2) %>%
group_by(document1) %>%
arrange(document1,desc(cosine)) %>%
dplyr::summarize(similar_later=paste(document2, collapse="_"))
pair_sec <- sim_tmp %>%
select(cosine,pair_id,document1,document2) %>%
group_by(document2) %>%
arrange(document2,desc(cosine)) %>%
dplyr::summarize(similar_earlier=paste(document1, collapse="_"))
#join the column to df_join
df_join <- df_join %>%
left_join(pair_pri, by=c("Doc_id"="document1")) %>%
left_join(pair_sec, by=c("Doc_id"="document2"))
glimpse(df_join)
Rows: 845
Columns: 11
$ Doc_id <int> 121, 135, 152, 154, 237, 251, 341, 391, 420…
$ Text <chr> "Fifteen members of the Protectors of Krono…
$ Source <chr> "All News Today", "All News Today", "All Ne…
$ Title <chr> "POK PROTESTS END IN ARRESTS", "RALLY SCHED…
$ Author <chr> NA, NA, NA, "Petrus Gerhard", NA, NA, NA, N…
$ date <date> 2005-04-06, 2012-04-09, 1993-02-02, 1998-0…
$ Location <chr> "ELODIS, Kronos", "ABILA, Kronos", "ABILA, …
$ Note <chr> NA, NA, NA, "This article is the first in a…
$ role <chr> "1_primary_1_secondary", "4_primary", "3_pr…
$ similar_later <chr> "760", "414_239_207_773", "639_210_748", "3…
$ similar_earlier <chr> "221", NA, NA, "69", NA, "688_297", "290", …
Corpus explorer is used to visualize each document with the metadata, doc_id, title, author, date, location, role, similar document published later than it, and similar document published earlier than it. The main usage is achieved by searching keyword.
The shiny app is first deplyed to shiny.io and then embedded in the R markdown. For full view, please click here to view the app in a browser.
Searching primary and secondary in the column “role” that was previously created, there are 5 outstanding sources with significantly more articles as primary than as secondary. These are the primary sources identified.
#### Interactive network graph
Network graph is used to visualize the clusters surrounding these sources. Prepare nodes and edges for all sources to identify the neighboring nodes of the primary sources. Then they are labeled with groups.
There are 5 primary news sources, namely “Homeland Illumination”, “International Times”, “The Abila Post”, “Kronos Star” and “The World”. They are in the center of each community in the above graph with outward pointing arrows. The arrow is from primary to secondary sources. The width of each edge indicates the number of similar articles between primary and secondary. The heatmap above plots the average cosine similarity between each pair. The highest is seen in “Homeland Illumination-All News Today”, “Internal Times-World Source”, “Kronos Star-International News”, “The Abila Post-Central bulletin”, and “The World-Who What News”. This corroborate that the 5 sources are indeed primary sources. The other news sources mostly re-post or report based on the primary sources.
It is also to be noted there are 3 sources rather independent, “Tethys News”, “Centrum Centinal”, “Modern Rubicon”. Their articles are not tagged with any similar articles with similarity more than 0.54.
Another source to pay attention to is “News Online Today”. It has the most number of articles, 111. Their article content is similar to all other primary sources, i.e. edges pointing to “News Online Today” with weights of 10 to 22 respectively. It reports from mixed sources which enables it to have a balanced point of view.
Overview: Corpus explorer from Task 1 will be used again to view each text. By searching for key terms, “POK”, “APA”, “government”, and “GASTech”, we aim to find out if there is any bias held against these terms in each day source. Additionally, word cloud is plotted for each source to find frequently appeared words. Analysis will be conducted in news source clusters found from previous task.
From the background information, we understand that POK(protectors of kronos) was established to advocate environmental protection in Kronos. A key incident in the growth of POK was death of Juliana Vann, a 10-year-old girl who died from leukemia, directly or indirectly resulted from water pollution. Water pollution was contributed mostly by GASTech factory in Kronos.
In general, POK is against GASTech and Kronos government did little to help with the water pollution. Because POK is clearly on the opposite side of government and GASTech, in exploring the corpus, we filter out 2 kinds of news:
By analysing these texts, we aim to find opinions held by news sources.
The shiny app is from previous section. Search for POK (in red), government (in blue) and gastech (in green) to highlight in corpus.
This shows a distribution of occurences of keywords in each news source.
From the plot, clearly there is barely no mentioning of POK in Centrum Sentinel, Modern Rubicon, Tethys News, The General Post, The light of Truth, The Tulip, The World, and Who What News. These sources tend to report more on GASTech with some news on government. From previous network visualization, we can also find that they exclusively belong to 2 groups, independent circled in red or news sources surrounding The World circled in purple.
From corpus explorer and the datatable below, we can conclude that Centrum Sentinel, Modern Rubicon, Tethys News report updates on the kidnap incident with timestamp. They only focus on reporting the incident. The time of posts ranges across 2 days from Jan 20 to Jan 21, 2014. Next, we obtain details from searching pok in the datatable.
It can be concluded that article in Modern Rubicon is inferring that POK is the kidnapper and demanded for ransom of 20 million dollars. “1245 - I REDEEM DEMANDS FROM POK - the protections of Kronos has freed a supporting responsibility of declaration of the kidnapping of employs you of GAStech that they demand I redeem $20 million”
The only article mentioning POK is from Petrus Gerhard, mentioning that POK is trying to communicate without hinderance from the government. It suggests a neutral-pro POK stand of the paper. “The POK is forced in order to go to the great lengths in order to communicate with you without obstacle from the government”
in Tethys News, articles are reporting the facts happening and in a rather neutral stand. “We have not received the information in a position to confirming the role of the POK in this kidnapping declared, but we notice that they are develops more and more violent in these last 5 years.”
As the other 5 sources have a centroid, “The World”, a wordcloud is plot to visualize topics involved in the texts.
The outstanding keywords are “Gastech”, “Government”, “International”, “Kronos”, “Sanjorge”. These are neutral words indicating that sources clustered with The World discusses news about Gastech and government in a neutral tone. Additionally, it is a business oriented group of news sources with less focus on politics. “Contamination” is a frequent word too. This is associated with the accusation from POK for polluting the water. However, a major incident in POK’s history which is related to water pollution, Juliana Vann’s death was not mentioned here. This indicates that The World is not telling stories from the POK’s point of view. Instead, it holds a rather neutral stand.
Filtering out documents with keyword “protest” and “rally”. Keywords Juliana, corruption, support are assumed to be used more often for POK inclined sources, as these are usually the topic in their activities. On the other hand, terror, violence and vandalism are considered more government inclined, as these are usually how they see POK.
From the above, we can preliminarily draw conclusion that Homeland Illumination is more biased towards POK and Kronos Star is more inclined to the government. A comparison wordcloud and commonality wordcloud is plotted to further analyze the two primary news source.
From the comparison, we can conclude that indeed Kronos Star is more a voice of the government, with a lot of mentioning of police; president; Kapelou, who is the new Kronos President; building, which refers to the Abila Capital Bulding where protests happened. Whereas in Homeland Illumination, keywords are elian, henk, karel, bodrogi who are leaders of POK; death, water, wellness, river which are the main motives of POK. It is also to note that in the comparison graph, Kronos Star mentions violence and security, wich do not appear in Homeland Illumination.
Both mention gastech, kronos, pok, government, health. It means that the 2 sources are likely to discuss about the same topics but from different perspective. Further supporting evidence can be checked with interactive data table.
As POK is against GASTech, we are keen to find out their view on POK’s celebration on 21st Jan by searching celebrate in Text and limit the Source to Homeland Illumination and Kronos Star.
The above supported the point that Homeland Illumination is a voice of POK while Kronos Star is more inclined to the government and government-GASTech colaboration.
Another incident to differentiate the views of the two sources is Elian Karel’s death. The breakdown sentences shows that: + Homeland Illumination is in a supportive role of POK. Some sentences such as, “Elian Karel, age 28, died Friday while unlawfully incarcerated at the Abila City Prison” in one of the articles showed sympathy for the POK leader’s death in prison. + Kronos Star described it as “Elian Karel, who died of natural causes in jail last year” which shows no sympathy. + The Abila Post is in favor of POK on Elian’s death by portraying him as “The POK’s second major martyr is Elian Karel, who died in 2009 while incarcerated of undetermined causes” + International Times has a neutral stand on Elian’s death and describe him as “Voice of protest for some, popularist demagogue for others, Protectors of Kronos leader Elian Karel became a martyr and a rallying point for some; a figure of disdain and rabblerousing for others, and a political concern to the Kronosian government”
Comparison and Commonality wordcloud between the Abila Post and International Times
Even though both reported news on gastech, kronos, government, pok and gastech CEP Sanjorge, there is significantly more mentioning of GASTech, Kronos in The Abila Post than in the International Times. Kapelou, the Kronos President also appears frequently in the Abila Post. It is probably because The Abila Post is a local paper and International Times reports global issues, as hinted by their names.
News Sources are analyzed by clusters. The bias is annotaed and grouped.
install chorddiag and load library. Chord diagram will be used to analyze email conversation between employees.
devtools::install_github("mattflor/chorddiag")
library(chorddiag)
Read in employee records to form nodes. Format the columns of dates to date format. A new column “Name” is created to replace FirstName and LastName. Column “Age” is created from BirthDate. Age is calculated until 2014 when the kidnap incident happened.
employee_records <- read_csv("data/EmployeeRecords.csv")
employee_records <- employee_records %>%
mutate(Name=paste(FirstName, LastName, sep=".")) %>%
transform(Name=sub(" ", ".", Name)) %>%
transform(BirthDate=parse_date_time(BirthDate,"%d/%m/%y"),
CitizenshipStartDate=parse_date_time(CitizenshipStartDate, "%d/%m/%y"),
PassportIssueDate=parse_date_time(PassportIssueDate, "%d/%m/%y"),
PassportExpirationDate=parse_date_time(PassportExpirationDate,
"%d/%m/%y"),
CurrentEmploymentStartDate=parse_date_time(CurrentEmploymentStartDate,
"%d/%m/%y"),
MilitaryDischargeDate=year(parse_date_time(MilitaryDischargeDate,
"%d/%m/%y"))) %>%
mutate(Age=ifelse(year(BirthDate)>2014,
2014-year(BirthDate)+100, 2014-year(BirthDate))) %>%
select(Name, Age,everything(), -FirstName, -LastName, -BirthDate)
Make row id a new column “id” as the node id. Name is the node label. Rename some columns for easier understanding.
employee_nodes <- employee_records %>%
select(Name,Gender,Age,CurrentEmploymentType,CurrentEmploymentTitle,
CitizenshipCountry) %>%
rename(label=Name, Department=CurrentEmploymentType,
Title=CurrentEmploymentTitle, Country=CitizenshipCountry) %>%
arrange(label) %>%
rowid_to_column("id")
Load in edges list from email headers. Separate SentDate and SentTime. Because all emails were sent in 2014 Jan, additional column is created for day of month. This will be used to visualize the number of emails in each day by each sender, considering a email will be sent to multiple receiver. Another column, “Role”, is created to indicate if the email is a direct email or reply to other emails.
employee_email <- read.csv("data/email headers.csv", encoding="UTF-8")
employee_email_agg <- employee_email %>%
transform(Subject=stringi::stri_enc_toascii(Subject)) %>%
separate_rows(To, sep=",") %>%
separate(From, c("From","FromEmail"), sep="@") %>%
separate(To, c("To","ToEmail"), sep="@") %>%
mutate(To=str_trim(To)) %>%
transform(Date=parse_date_time(Date, c("mdy_hm","mdy"))) %>%
transform(SentDate=date(Date),
SentTime=format(Date, format="%H:%M")) %>%
mutate(Type=ifelse(str_detect(Subject, "RE:")==TRUE, "reply","direct")) %>%
select(From, To, SentDate, SentTime, Type, Subject) %>%
rename(Source=From, Target=To) %>%
transform(Source=sub(" ", ".", Source),
Target=sub(" ", ".", Target))
Manipulate table to detect any employee who sends email to all employees, i.e. send to 53 recipients excluding himself or herself. Group the records by source, day, type, subject. The table below indicates that only 2 people who send to the whole company, Mat Bramar and Ruscella Mies Haber. They both work in administration department. Mat is assistant to CEO. From the subjects, we can conlude that he regularly sends announcement, reminders, and matters that relate to the whole company, such as “All staff announcement” and “Changes to travel policy”.
Ruscella is assistant to engineering group manager. She sends only 2 types of emails to the whole company, “Daily morning announcements” and “Good morning, GasTech!” Both are daily news updates for the whole company to read.
Next, we will remove such exchanges that were sent to all employees for better interpretation of the emails. This includes both directed email and reply emails to all.
Prepare nodes and edges table for visNetwork.
employee_nodes <- employee_nodes %>%
#add tooltip column
mutate(title=paste("<p>",label,Gender,Age,"</br>",
"<br>",Department, Title,"</br></p>",sep=" ")) %>%
rename(group=Department)
visNetwork(employee_nodes, employee_edges) %>%
visEdges(arrows = 'to', ) %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
visLegend() %>%
visIgraphLayout(layout="layout_with_fr")
Visualize together with interactive table of all emails.
From the background information, we know that the founding leaders of POK are Henk Bodrogi, Jeroen Karel, Carmine Osvaldo and the current leaders are Elian Karel, Silvia Marek, Mandor Vann, Isia Vann, Lucio Jakab, Lorenzo Di Stefano, Valentine Mies, Yanick Cato and Joreto Katell.
There are 7 employees with the same last name of middle name as the POK leaders. 5 of them are in securities and 1 of them is the truck driver. The combination enables them to conveniently bring in suspected people or bring out people sneakily.
name <- "Bodrogi|Karel|Osvaldo|Marek|Vann|Jakab|Stefano|Mies|Cato|Katell"
employee_nodes_suspect <- employee_nodes %>%
filter(str_detect(label, name))
employee_email_agg2 %>%
filter(Source %in% c("Isia.Vann", "Rachel.Pantanal"),
Target %in% c("Isia.Vann", "Rachel.Pantanal")) %>%
DT::datatable(filter="top", caption= htmltools::tags$caption(
style = 'caption-side: top; text-align: left;',
htmltools::em("Emails between Isia Vann and Rachel Pantanal")),
rownames=FALSE,
options = list(pageLength = 5, autoWidth = FALSE))